Take-home Exercise 2

Published

May 29, 2023

VAST Challenge: Mini-Challenge 2

Project Brief

The country of Oceanus has sought FishEye International’s help in identifying companies possibly engaged in illegal, unreported, and unregulated (IUU) fishing. Using import/export trade data, FishEye’s analysts hope to understand business relationships, including finding links that will help them stop fishy IUU activities and protect marine species that are affected by it.

FishEye knows from past experience that companies caught fishing illegally will often shut down then start up again under a different name. Visualising these temporal patterns could thus help in comparing the activities of companies over time to determine if the companies have returned to their nefarious acts.

Project Objectives

This study thus aims to visualise temporal patterns for individual entities and between entities from trade records, and categorize the types of business relationship patterns found through analysis, paying particular attention to abnormalities in to detect fishy trade activities.

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'primaryColor': '#93c7c2',
      'primaryTextColor': '#fff',
      'primaryBorderColor': '#3d7670',
      'lineColor': '#3d7670',
      'secondaryColor': '#3d7670',
      'tertiaryColor': '#fff'
    }
  }
}%%

flowchart LR
    A{fa:fa-fish-fins \nOverall\nnetwork} -->|patterns?| B(Time)
    B -.- C((?))
    

1: Data Preparation

pacman::p_load(jsonlite, tidyverse, DT, lubridate, Hmisc, urbnthemes,
               visNetwork, tidygraph, ggraph, ggiraph, igraph, scales, ggplot2, 
               gganimate, ggstatsplot, ggrain, ggridges, graphlayouts, plotly, patchwork,
               kableExtra, ggpubr, ggrepel)

Use jsonlite package to read .json files

mc2 <- jsonlite::fromJSON("data/mc2_challenge_graph.json")
glimpse(mc2)
List of 5
 $ directed  : logi TRUE
 $ multigraph: logi TRUE
 $ graph     : Named list()
 $ nodes     :'data.frame': 34576 obs. of  4 variables:
  ..$ shpcountry: chr [1:34576] "Polarinda" NA "Oceanus" NA ...
  ..$ rcvcountry: chr [1:34576] "Oceanus" NA "Oceanus" NA ...
  ..$ dataset   : chr [1:34576] "MC2" "MC2" "MC2" "MC2" ...
  ..$ id        : chr [1:34576] "AquaDelight Inc and Son's" "BaringoAmerica Marine Ges.m.b.H." "Yu gan  Sea spray GmbH Industrial" "FlounderLeska Marine BV" ...
 $ links     :'data.frame': 5464378 obs. of  9 variables:
  ..$ arrivaldate     : chr [1:5464378] "2034-02-12" "2034-03-13" "2028-02-07" "2028-02-23" ...
  ..$ hscode          : chr [1:5464378] "630630" "630630" "470710" "470710" ...
  ..$ valueofgoods_omu: num [1:5464378] 141015 141015 NA NA NA ...
  ..$ volumeteu       : num [1:5464378] 0 0 0 0 0 0 0 0 0 0 ...
  ..$ weightkg        : int [1:5464378] 4780 6125 10855 11250 11165 11290 9000 19490 6865 19065 ...
  ..$ dataset         : chr [1:5464378] "MC2" "MC2" "MC2" "MC2" ...
  ..$ source          : chr [1:5464378] "AquaDelight Inc and Son's" "AquaDelight Inc and Son's" "AquaDelight Inc and Son's" "AquaDelight Inc and Son's" ...
  ..$ target          : chr [1:5464378] "BaringoAmerica Marine Ges.m.b.H." "BaringoAmerica Marine Ges.m.b.H." "-15045" "-15045" ...
  ..$ valueofgoodsusd : num [1:5464378] NA NA NA NA NA ...

The json file contains 2 dataframes:

  • nodes
    • 34576 observations of 4 variables
    • shpcountry, rcvcountry, dataset and id
  • links
    • 5,464,378 observations of 9 variables
    • arrivaldate, hscode, valueofgoods_omu, volumeteu, weightkg, dataset, source, target, valueofgoodsusd

Notes:
- volumeteu: TEU (Twenty-foot Equivalent Unit) is a unit of measurement of shipping volume, used to determine cargo capacity for container ships and terminals.
- hscode: (Harmonized System Codes) is a standardized numerical method of identifying traded products.

1.1: Splitting data into nodes and edges

mc2_nodes <- as_tibble(mc2$nodes) %>%
    # reorder dataframe columns
  select(id, rcvcountry, shpcountry) %>%
  arrange(id)

mc2_links <- as_tibble(mc2$links) %>%
  # Extract year from date and save as factor column
  mutate(year = as_factor(year(arrivaldate))) %>%
  
  # Move Source and Target to the front
  select(source, target, hscode, year, arrivaldate, weightkg, volumeteu, valueofgoods_omu, valueofgoodsusd)

1.2: Data Health

Diagnostic checks to survey the extent of “incompleteness” of the data

I. Checking for Missing Values:

# Check for columns with missing values
colSums(is.na(mc2_nodes))
        id rcvcountry shpcountry 
         0       2909      22359 

rcvcountry has 2909 missing values, and shpcountry has 22359 missing values. There are more incomplete records pertaining to origin of the trades. To facilitate analysis, all NA values are replaced by “Unknown”:

mc2_nodes <- mc2_nodes %>%
  replace(is.na(mc2_nodes), "Unknown")

II. Checking for Duplicates:

# Check for duplicated rows
mc2_nodes[duplicated(mc2_nodes),]
# A tibble: 0 × 3
# ℹ 3 variables: id <chr>, rcvcountry <chr>, shpcountry <chr>

There are no duplicated entries in the nodes tibble.

III. Assigning unique identifier:

As the company names are long and may be difficult to use as labels, a numerical universal unique identifier (uuid) is assigned to each company:

# Add unique id for each company
mc2_nodes$uuid <- seq.int(nrow(mc2_nodes))

I. Checking for Missing Values:

# Check for columns with missing values
colSums(is.na(mc2_links))
          source           target           hscode             year 
               0                0                0                0 
     arrivaldate         weightkg        volumeteu valueofgoods_omu 
               0                0           520933          5464097 
 valueofgoodsusd 
         3017844 

valueofgoods_omu has too many missing values and will unnecessarily skew aggregated values. While some missing values could be intentional non-reporting of value and thus indicate possible IUU, there are too many to be able to filter out based on this criteria alone. This column will be dropped for the analysis, while the relationship between weightkg, volumeteu and valueofgoodsusd will be looked into more closely to check for possible discrepencies.

II. Checking for Duplicates:

# Check for duplicated rows
mc2_links[duplicated(mc2_links),]
# A tibble: 155,291 × 9
   source    target hscode year  arrivaldate weightkg volumeteu valueofgoods_omu
   <chr>     <chr>  <chr>  <fct> <chr>          <int>     <dbl>            <dbl>
 1 French C… Mar d… 851629 2028  2028-03-09        20         0               NA
 2 French C… Mar d… 851629 2028  2028-06-02       110         0               NA
 3 French C… -1992  852990 2028  2028-01-02      2830         0               NA
 4 French C… -1992  852990 2028  2028-01-02      2830         0               NA
 5 French C… -1992  852990 2028  2028-01-04      2805         0               NA
 6 French C… -1992  852990 2028  2028-01-11      1895         0               NA
 7 French C… -1992  852990 2028  2028-01-11      2790         0               NA
 8 French C… -1992  852990 2028  2028-01-11      2790         0               NA
 9 French C… -1992  852990 2028  2028-01-18      1895         0               NA
10 French C… -1992  852990 2028  2028-01-18      2790         0               NA
# ℹ 155,281 more rows
# ℹ 1 more variable: valueofgoodsusd <dbl>

There are 155,291 duplicated transactions. As these may be indicative of possible fishy activity, the duplicated rows are saved separately as mc2_links_dup.

# Save duplicated transactions in a separate tibble
mc2_links_dup <- mc2_links[duplicated(mc2_links),]

# Leave only unique transactions in original edges tibble
mc2_links <- unique(mc2_links)

2: Exploring Fishy Data Patterns

From the Summary Statistics and preliminary visualisations in Section 1.4, traderoutes will be investigated based on abnormalities in temporal trends across number of trades and export weight. The analysis will focus on visualising networks based on filtered records:

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'primaryColor': '#93c7c2',
      'primaryTextColor': '#fff',
      'primaryBorderColor': '#3d7670',
      'lineColor': '#3d7670',
      'secondaryColor': '#3d7670',
      'tertiaryColor': '#fff'
    }
  }
}%%

flowchart LR
    A{fa:fa-fish-fins \nOverall\nnetwork} ==> B((Country\nLevel))
    A ==> C((Company\nLevel))
    C -.- D{fa:fa-calendar-days \nTime} -->|patterns?| E(trade frequency)
    B -.- D -->|patterns?| F(trade weight)
    D -->|patterns?| G(trade value\n$usd)

2.1 Country Level Trade Networks

In order to get plot a network graph detailing links between countries, new links and nodes dataframes have to be created from existing data.

country_links <- mc2_links_agg %>%
  # Match source company to nodes to get shipping origin country
  inner_join(mc2_nodes, by = join_by("source" == "id")) %>%
  # Use shpcountry as source, rcvcountry as target for links
  select(shpcountry, rcvcountry, year, yearmonth, weight, weightkg) %>%
  filter(shpcountry != rcvcountry)

2.1.1: Investigating 2030 data

From the summary statistics outlined in Section 1, there was an observed abnormal increase in tradecount in March 2030. Trade records from 2030 are thus filtered out for investigation:

code block
country_links_2030 <- country_links %>%
  filter(year == "2030") %>%
  group_by(shpcountry, rcvcountry, yearmonth) %>%
  summarise(weight = sum(weight),
            weightkg = sum(weightkg)) %>%
  filter(shpcountry != rcvcountry) %>%
  ungroup()
distinct_shpcountry <- country_links_2030 %>%
    distinct(shpcountry) %>%
    mutate(country = shpcountry) %>%
    select(country)

distinct_rcvcountry <- country_links_2030 %>%
    distinct(rcvcountry) %>%
    mutate(country = rcvcountry) %>%
    select(country)

country_nodes_2030 <- unique(rbind(distinct_shpcountry, distinct_rcvcountry))
country_graph_2030 <- tbl_graph(nodes = country_nodes_2030,
                          edges = country_links_2030, 
                          directed = TRUE)
country_graph_2030
# A tbl_graph: 137 nodes and 4439 edges
#
# A directed multigraph with 1 component
#
# A tibble: 137 × 1
  country   
  <chr>     
1 -22004    
2 -22005    
3 -22007    
4 Alverossia
5 Alverovia 
6 Andenovia 
# ℹ 131 more rows
#
# A tibble: 4,439 × 5
   from    to yearmonth  weight weightkg
  <int> <int> <date>      <int>    <dbl>
1     1    75 2030-01-01      6   580805
2     1    75 2030-02-01      9  1732215
3     1    75 2030-03-01      2   252950
# ℹ 4,436 more rows

Step 4: Visualise country trade network for 2030:

code block
# Measure directed out-degree centrality and save as a column
V(country_graph_2030)$out_degree <- degree(country_graph_2030, mode = "out")

set.seed(1234)
g2030 <- country_graph_2030 %>%
  activate(edges) %>%
  mutate(yearmonth = factor(format(yearmonth, "%b"),
                            levels = c("Jan", "Feb", "Mar", "Apr", 
                            "May", "Jun", "Jul", "Aug", 
                            "Sep", "Oct", "Nov", "Dec"))) %>%
  ggraph(layout = "linear",
         circular = TRUE) +
  geom_edge_fan(
    aes(width = weight,
        color = yearmonth),
    alpha = .6,
    arrow = arrow(length = unit(2, 'mm')),
    show.legend = FALSE
  ) +
  scale_edge_width(
    range = c(0.1,4)
  ) +
  geom_node_point(
    aes(size = out_degree),
    color = "grey20",
    alpha = .7,
    show.legend = FALSE
  ) +
  geom_node_text(
    aes(label = ifelse(out_degree > quantile(out_degree, .75), country, "")), 
    size = 3,
    repel = TRUE
  ) +
  theme(
    plot.title = element_text(size = 20,
                              color = "grey20"),
    legend.title = element_text(),
    plot.background = element_rect(fill="#F8F3E6",colour="#F8F3E6")
  ) +
  transition_states(yearmonth,
                    transition_length = 6,
                    state_length = 4
  ) +
  labs(
    title = "Country-level Trade Network in {closest_state} 2030",
    subtitle = "Larger nodes have higher Out-Degree Centrality. Countries with the most export routes are identified: "
  ) +
  enter_fade() +
  exit_fade()

g2030

Insights from Network Graph:

There are no visually fishy patterns identified in March 2030, compared to the other months. The links with thicker widths represent higher number of traderoutes – these have largely remained the same throughout the year. This suggests that the abnormal increase could be detected at a company-level instead of a country-level network analysis. However, more information is first gathered from country-level traderoute activities to sieve out possible flags for IUU activity.

2.1.2: Visualising Records Across the Years

Which countries have the heaviest trade activities? Plotting an overall network graph would result in a dense web of links with little analytical value. To detect fishy temporal patterns, countries that have higher traderoute counts are used for the network analysis.

# Filtering out countries by weight
country_links_top <- top_n(country_links, 1000, weight)
# Getting unique country names from links data
distinct_shpcountry_all <- country_links_top %>%
    distinct(shpcountry) %>%
    mutate(country = shpcountry) %>%
    select(country)

distinct_rcvcountry_all <- country_links_top %>%
    distinct(rcvcountry) %>%
    mutate(country = rcvcountry) %>%
    select(country)

# Creating nodes list from combined links data
country_nodes <- unique(rbind(distinct_shpcountry_all, distinct_rcvcountry_all))

# Create Graph Object
country_graph <- tbl_graph(nodes = country_nodes,
                          edges = country_links_top, 
                          directed = TRUE)
country_graph
# A tbl_graph: 24 nodes and 1005 edges
#
# A directed acyclic multigraph with 2 components
#
# A tibble: 24 × 1
  country   
  <chr>     
1 Marebak   
2 Rio Isla  
3 Kondanovia
4 Alverossia
5 Merigrad  
6 Utoporiana
# ℹ 18 more rows
#
# A tibble: 1,005 × 6
   from    to year  yearmonth  weight weightkg
  <int> <int> <fct> <date>      <int>    <dbl>
1     1    17 2028  2028-01-01    435  7607660
2     1    17 2028  2028-01-01    199  3353530
3     2    18 2028  2028-01-01    307 24667540
# ℹ 1,002 more rows

I. Out-degree centrality (Visualising top Exports)

code block
# Measure directed out-degree centrality and save as a column
V(country_graph)$out_degree <- degree(country_graph, mode = "out")

set.seed(1234)
g_out <- country_graph %>%
  ggraph(layout = "kk"
  )+
  geom_edge_link(
    aes(width = weight,
        color = year),
    alpha = .6,
    arrow = arrow(length = unit(2, 'mm')),
    show.legend = FALSE
  ) +
  scale_edge_width(
    range = c(0.1,4)
  ) +
  geom_node_point(
    aes(size = out_degree),
    color = "grey20",
    show.legend = FALSE
  ) +
  geom_node_text(
    aes(label = ifelse(out_degree > quantile(out_degree, .5), country, "")), 
    size = 3,
    nudge_x = .5
  ) +
  theme(
    plot.title = element_text(size = 20,
                              color = "grey20"),
    legend.title = element_text(),
    plot.background = element_rect(fill="#F8F3E6",colour="#F8F3E6")
  ) +
  transition_states(year,
                    transition_length = 3,
                    state_length = 4
  ) +
  labs(
    title = "Country Trade Network for Year {closest_state}",
    subtitle = "Larger nodes have higher Out-Degree Centrality"
  ) +
  enter_fade() +
  exit_fade()

g_out

II. In-degree centrality (Visualising top Imports)

code block
# Measure directed out-degree centrality and save as a column
V(country_graph)$in_degree <- degree(country_graph, mode = "in")

set.seed(1234)
g_in <- country_graph %>%
  ggraph(layout = "kk"
  )+
  geom_edge_link(
    aes(width = weight,
        color = year),
    alpha = .6,
    arrow = arrow(length = unit(2, 'mm')),
    show.legend = FALSE
  ) +
  scale_edge_width(
    range = c(0.1,4)
  ) +
  geom_node_point(
    aes(size = in_degree),
    color = "grey20",
    show.legend = FALSE
  ) +
  geom_node_text(
    aes(label = ifelse(in_degree > quantile(in_degree, .5), country, "")), 
    size = 3,
    nudge_x = .5
  ) +
  theme(
    plot.title = element_text(size = 20,
                              color = "grey20"),
    legend.title = element_text(),
    plot.background = element_rect(fill="#F8F3E6",colour="#F8F3E6")
  ) +
  transition_states(year,
                    transition_length = 3,
                    state_length = 4
  ) +
  labs(
    title = "Country Trade Network for Year {closest_state}",
    subtitle = "Larger nodes have higher In-Degree Centrality. Countries with the most imports are identified:"
  ) +
  enter_fade() +
  exit_fade()

g_in

III. Betweenness Centrality

code block
# Calculate betweenness centrality and save values to new column
country_graph <- country_graph %>%
  mutate(betweenness = centrality_betweenness())

set.seed(1234)
g_bc <- country_graph %>%
  ggraph(
    layout = "kk") +
  geom_edge_arc(
    aes(width = weight,
        color = year),
    alpha = .6,
    arrow = arrow(length = unit(2, 'mm')),
    show.legend = FALSE
  ) +
  scale_edge_width(
    range = c(0.1,4)
  ) +
  geom_node_point(
    aes(size = betweenness),
    color = "grey20"
  ) +
  geom_node_text(
    aes(label = ifelse(betweenness > quantile(betweenness, .5), country, "")), 
    size = 3,
    nudge_y = .3
  ) +
  theme(
    plot.title = element_text(size = 20,
                              color = "grey20"),
    legend.position = "bottom",
    legend.title = element_text(),
    plot.background = element_rect(fill="#F8F3E6",colour="#F8F3E6")
  ) +
  transition_states(year,
                    transition_length = 3,
                    state_length = 4
  ) +
  labs(
    title = "Country Trade Network for Year {closest_state}",
    subtitle = "Larger nodes have higher Betweenness Centrality"
  ) +
  enter_fade() +
  exit_fade()

g_bc

IV. Eigenvector Centrality

code block
# Calculate eigenvector centrality and save values to new column
ec <- eigen_centrality(country_graph)$vector
V(country_graph)$eigencentrality <- ec

set.seed(1234)
g_ec <- country_graph %>%
  ggraph(layout = "kk") +
  geom_edge_arc(
    aes(width = weight,
        color = year),
    alpha = .6,
    arrow = arrow(length = unit(2, 'mm')),
    show.legend = FALSE
  ) +
  scale_edge_width(
    range = c(0.1,4)
  ) +
  geom_node_point(
    aes(size = eigencentrality),
    color = "grey30"
  ) +
  geom_node_text(
    aes(label = ifelse(eigencentrality > quantile(eigencentrality, .5), country, "")), 
    size = 3,
    nudge_y = -.1
  ) +
  theme(
    plot.title = element_text(size = 20,
                              color = "grey20"),
    legend.position = "bottom",
    legend.title = element_text(),
    plot.background = element_rect(fill="#F8F3E6",colour="#F8F3E6")
  ) +
  transition_states(year,
                    transition_length = 3,
                    state_length = 4
  ) +
  labs(
    title = "Country Trade Network for Year {closest_state}",
    subtitle = "Larger nodes have higher Eigenvector Centrality"
  ) +
  enter_fade() +
  exit_fade()

g_ec

V. PageRank Score

code block
# Calculate PageRank Score and save values to new column
pr <- page_rank(country_graph)$vector
V(country_graph)$pagerank <- pr

set.seed(1234)
g_pr <- country_graph %>%
  ggraph(layout = "linear",
         circular = TRUE) +
  geom_edge_arc(
    aes(width = weight,
        color = year),
    alpha = .5,
    arrow = arrow(length = unit(2, 'mm')),
    show.legend = FALSE
  ) +
  scale_edge_width(
    range = c(0.1,4)
  ) +
  geom_node_point(
    aes(size = pagerank),
    color = "grey30"
  ) +
  geom_node_text(
    aes(label = country), 
    size = 3,
    nudge_y = -.05
  ) +
  theme(
    plot.title = element_text(size = 20,
                              color = "grey20"),
    legend.title = element_text(),
    plot.background = element_rect(fill="#F8F3E6",colour="#F8F3E6")
  ) +
  transition_states(year,
                    transition_length = 5,
                    state_length = 5
  ) +
  labs(
    title = "Country Trade Network for {closest_state}",
    subtitle = "Larger nodes have higher PageRank Score"
  ) +
  enter_fade() +
  exit_fade()

g_pr

2.1.2: Comparing Distribution of Centrality Measures

Data Preparation:

code block
# Retrieve Centrality measures from nodes
country_centrality <- data.frame(
  OutDegree = V(country_graph)$out_degree,
  InDegree = V(country_graph)$in_degree,
  Betweenness_centrality = V(country_graph)$betweenness,
  Eigenvector_centrality = V(country_graph)$eigencentrality,
  PageRank_score = V(country_graph)$pagerank
)

# Create function to transform variables to same scale (min-max normalisation)
transform_variable <- function(x) {
  (x - min(x)) / (max(x) - min(x))
}

# function to apply scaling across all variables in a dataframe
transform_dataframe <- function(df) {
  df %>%
    mutate(across(where(is.numeric) & !matches("uuid"), transform_variable))
}

# Apply function and save to new dataframe
country_centrality_scaled <- transform_dataframe(country_centrality)

# Pivot longer to get centrality measures as factors
country_centrality_long <- gather(country_centrality_scaled, key = "CentralityMeasure", value = "CentralityScore")
code block
# Density ridges to show distribution of data
ggplot(
    country_centrality_long, 
    aes(x = CentralityScore, 
        y = CentralityMeasure, 
        fill = CentralityMeasure,
        color = CentralityMeasure)
  ) +
  geom_density_ridges(
    alpha = .6,
    scale = 3
  ) +
  geom_rug() +
  labs(
    title = "Similar Distribution of Values Across Centrality Measures",
    subtitle = "Right-skewed distribution with presence of outliers",
    x = "Centrality Score",
    y = "Centrality Measure"
  ) +
  theme(
    legend.position = "none",
    axis.title = element_blank(),
    panel.grid.major = element_blank(),
    plot.background = element_rect(fill="#F8F3E6",colour="#F8F3E6")
  )
Picking joint bandwidth of 0.0303

Insights from Network Graphs:

2.2 Company Level Trade Networks: Top Traderoutes by Frequency

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'primaryColor': '#93c7c2',
      'primaryTextColor': '#fff',
      'primaryBorderColor': '#3d7670',
      'lineColor': '#3d7670',
      'secondaryColor': '#3d7670',
      'tertiaryColor': '#fff'
    }
  }
}%%

flowchart LR
    A{fa:fa-fish-fins \nOverall\nnetwork} ==> B((Country\nLevel))
    A ==> C((Company\nLevel)) -.-|filter| B
    C -.- D{fa:fa-calendar-days \nTime} -->|patterns?| E(trade frequency)
    B -.- D -->|patterns?| F(trade weight)
    D -->|patterns?| G(trade value\n$usd)

# Extract centrality metrics from country graph and save into new data frame
country_filter <- data.frame(
  country = V(country_graph)$country,
  OutDegree = V(country_graph)$out_degree,
  InDegree = V(country_graph)$in_degree,
  Betweenness_centrality = V(country_graph)$betweenness,
  Eigenvector_centrality = V(country_graph)$eigencentrality,
  PageRank_score = V(country_graph)$pagerank
)

# Define function to filter each variable by percentile
percentile_filter <- function(x) {
  x >= quantile(x, .75)
}

# Filter the dataframe to retrieve list of countries 
country_filter <- country_filter %>%
  filter(percentile_filter(OutDegree) |
           percentile_filter(InDegree) |
           percentile_filter(Betweenness_centrality) |
           percentile_filter(Eigenvector_centrality) |
           percentile_filter(PageRank_score)
          )
mc2_nodes_filtered <- mc2_nodes %>%
  filter(shpcountry %in% country_filter$country | 
           rcvcountry %in% country_filter$country)

Step 3: Aggregate the frequency of traderoutes in filtered coutries

routes_by_count <- mc2_links_agg %>%
  # Ensure that companies in the dataframe are from top trading countries 
  filter(source %in% mc2_nodes_filtered$id |
           target %in% mc2_nodes_filtered$id) %>%
  group_by(source, target) %>%
  filter(source != target) %>%
  # Get count of trade route
  summarise(count = sum(weight)) %>%
  # Arrange in descending order to get top routes first
  arrange(desc(count)) %>%
ungroup()

datatable(routes_by_count)
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html

Insights from Table:
There seems to be many common source and target companies appearing across various traderoutes. This suggests that IUU fishing activity could be detected through visualising this information as a network and using centrality measures to determine key players. Since companies caught fishing illegally shut down but often start up again under a different company, identifying fishy companies by comparing their trading partners over the years could help determine if they are engaging in illegal acts.

2.2.1: What are abnormally large tradecounts?

code block
source_count <- routes_by_count %>%
  group_by(source) %>%
  summarise(sourcecount = n()) %>%
  arrange(desc(sourcecount)) %>%
  ungroup()
 
target_count <- routes_by_count %>%
  group_by(target) %>%
  summarise(targetcount = n()) %>%
  arrange(desc(targetcount)) %>%
  ungroup()
code block
medsource <- median(source_count$sourcecount)
qsource <- quantile(source_count$sourcecount, probs = .95)

distinctsource <- n_distinct(routes_by_count$source)
distincttarget <- n_distinct(routes_by_count$target)

source <- ggplot(source_count, 
       aes(1, 
           y = sourcecount)
  ) +
  geom_rain(
    alpha = .7,
    boxplot.args = list(
              color = "grey20",
              fill = "salmon",
              outlier.shape = NA),
    violin.args = list(alpha = .6,
                       fill = "salmon")
  ) +
  scale_y_continuous(
    limits = c(0, 1500),
    breaks = scales::pretty_breaks(n = 5)
  ) +
# Add annotation for median and 95th percentile value
  annotate(
    geom = "text",
    x = 1.3,
    y = 500,
    label = paste0("Median count: ", medsource, "  |  ",
                   "95th Percentile count: ", qsource, "  |  ",
                   "No. distinct:", distinctsource)
  ) +
  labs(
    title = "I. Distribution of traderoute counts per Source Company"
  ) + 
  theme(
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    plot.background = element_rect(fill="#F8F3E6",colour="#F8F3E6")
  ) + 
  coord_flip() 

medtarget <- median(target_count$targetcount)
qtarget <- quantile(target_count$targetcount, probs = .95)

target <- ggplot(target_count, 
       aes(1, 
           y = targetcount)
  ) +
  geom_rain(
    alpha = .7,
    boxplot.args = list(
              color = "grey20",
              fill = "#1696d2",
              outlier.shape = NA),
    violin.args = list(alpha = .6,
                       fill = "#1696d2")
  ) +
  scale_y_continuous(
    limits = c(0, 1500),
    breaks = scales::pretty_breaks(n = 5)
  ) +
  annotate(
    geom = "text",
    x = 1.3,
    y = 500,
    label = paste0("Median count: ", medtarget, "  |  ",
                   "95th Percentile count: ", qtarget, "  |  ",
                   "No. distinct:", distincttarget)
  ) +
   annotate(
    geom = "text",
    x = 0.9,
    y = 1200,
    label = "Presence of many Outliers with high traderoute counts"
  ) +
  labs(
    title = "II. Distribution of traderoute counts per Target Company"
  ) + 
  theme(
    axis.title.y = element_blank(),
    axis.title.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    plot.background = element_rect(fill="#F8F3E6",colour="#F8F3E6")
  ) + 
  coord_flip() 

countpatch <- (source / target) +
              plot_annotation(title = "Highly Right-skewed Distributions of Traderoute Counts",
                              theme = theme(plot.title = element_text(size = 16)))
countpatch & theme(plot.background = element_rect(fill="#F8F3E6",colour="#F8F3E6"))

Insights from Distribution of traderoute counts:

The density raincloud plots of traderoute counts reveal the presence of outliers with values far above the median and 95th percentile value. This suggets that abnormal fishing activity could be investigated by filtering out traderoutes with very high frequency. Investigating by using countries as filters, as well as companies beyond the top 95th percentile of traderoutes could thus give some insight into IUU fishing networks.

2.3: Who are the Top players in the Trade Network by Tradecount?

links_sorted <- mc2_links_agg %>%
  # Only keep companies in suspicious countries
   filter(source %in% mc2_nodes_filtered$id |
           target %in% mc2_nodes_filtered$id) %>%
  
  # Group and calculate total number of tradecounts per year
  group_by(source, target, year) %>%
  summarise(weight = sum(weight)) %>%
  
  # Arrange data in order of year and descending order of tradecounts to get top
  arrange(year, desc(weight)) %>%
  ungroup()
  
# Filter out top 30 per year
top_30 <- links_sorted %>%
  group_by(year) %>%
  # Get top 30 companies by weight
  top_n(30, weight) %>%
  ungroup()
# Create nodes dataframe for top 10 exporters from oceanus
ttdistinct_source <- top_30 %>%
  distinct(source)

ttdistinct_target <- top_30 %>%
  distinct(target)

# Select only overlapping nodes and distinct companies from links
ttnodes_source <- inner_join(
    ttdistinct_source, mc2_nodes,
    by = c("source" = "id")) %>%
  mutate(id = source)

ttnodes_target <- inner_join(
    ttdistinct_target, mc2_nodes,
    by = c("target" = "id")) %>%
  mutate(id = target)

# Create new nodes data by combining filtered dataframes
top_30_nodes <- bind_rows(ttnodes_source, ttnodes_target) %>%
  left_join(source_count, by = c("id" = "source")) %>%
  select(id, uuid, shpcountry, rcvcountry, sourcecount)
top_30_graph <- tbl_graph(nodes = top_30_nodes,
                          edges = top_30, 
                          directed = TRUE)
top_30_graph
# A tbl_graph: 86 nodes and 211 edges
#
# A directed acyclic multigraph with 15 components
#
# A tibble: 86 × 5
  id                                     uuid shpcountry rcvcountry  sourcecount
  <chr>                                 <int> <chr>      <chr>             <int>
1 Coastal Cruisers Pic Shipping         24448 Marebak    Oceanus             214
2 Marine Masterminds Dry dock           28774 Marebak    Coralmarica          88
3 Mar y Luna Sagl                       28702 Marebak    Coralmarica         309
4 Ianira Starfish Sagl Import           26454 Rio Isla   Sol y Ocea…          61
5 Orange River   GmbH & Co. KG Shipping 30325 Marebak    Oceanus             189
6 Olas del Sur Sea                      30295 Rio Isla   Coralmarica          67
# ℹ 80 more rows
#
# A tibble: 211 × 4
   from    to year  weight
  <int> <int> <fct>  <int>
1     1    46 2028    5816
2     2    47 2028    5509
3     3    30 2028    4822
# ℹ 208 more rows
code block
# Measure directed out-degree centrality and save as a column
V(top_30_graph)$out_degree <- degree(top_30_graph, mode = "out")

set.seed(1234)
ttg_out <- ggraph(top_30_graph, 
    layout = "linear"
  ) +
  geom_edge_arc(
    aes(width = weight,
        color = year), 
    alpha = .6,
    arrow = arrow(length = unit(2, 'mm')),
    show.legend = FALSE
  ) +
  scale_edge_width(
    range = c(0.1, 4)
  ) +
  geom_node_point(
    color = "grey30",
    aes(size = out_degree)
  ) +
  geom_node_text(
    aes(label = ifelse(out_degree > quantile(out_degree, .75), uuid, "")), 
    size = 3,
    repel = TRUE
  ) +
  guides(
    color = "none"
  ) +
  theme(
    plot.title = element_text(size = 20,
                              colour = "#3A3B60"),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.title = element_text(),
    legend.box = "vertical",
    plot.background = element_rect(fill="#F8F3E6",colour="#F8F3E6")
  ) +
  transition_states(year,
                    transition_length = 3,
                    state_length = 2
  ) +
  labs(
    title = 'Year: {closest_state}',
    subtitle = "Top Exporters by Degree Centrality >> \nLarger nodes represent higher out-degree Centrality, thicker links show higher frequency of trade"
  ) +
  enter_fade() +
  exit_fade()

ttg_out

In the plot above, the relative sizes of the nodes are based on Out-degree centrality, where larger nodes have a higher number of out-going links. This enables us to visually determine companies with higher number of exporting links over time. Subsequent plots using In-degree centrality and betweenness centrality instead are shown below:

code block
# Measure directed out-degree centrality and save as a column
V(top_30_graph)$in_degree <- degree(top_30_graph, mode = "in")

set.seed(1234)
ttg_in <- ggraph(top_30_graph, 
    layout = "linear"
  ) +
  geom_edge_arc(
    aes(width = weight,
        color = year), 
    alpha = .6,
    arrow = arrow(length = unit(2, 'mm')),
    show.legend = FALSE
  ) +
  scale_edge_width(
    range = c(0.1, 4)
  ) +
  geom_node_point(
    color = "grey30",
    aes(size = in_degree)
  ) +
  geom_node_text(
    aes(label = ifelse(in_degree > quantile(in_degree, .75), uuid, "")), 
    size = 3,
    repel = TRUE
  ) +
  guides(
    color = "none"
  ) +
  theme(
    plot.title = element_text(size = 20,
                              colour = "#3A3B60"),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.title = element_text(),
    legend.box = "vertical",
    plot.background = element_rect(fill="#F8F3E6",colour="#F8F3E6")
  ) +
  transition_states(year,
                    transition_length = 3,
                    state_length = 2
  ) +
  labs(
    title = 'Year: {closest_state}',
    subtitle = "Top Importers by Degree Centrality >> \nLarger nodes represent higher in-degree Centrality, thicker links show higher frequency of trade"
  ) +
  enter_fade() +
  exit_fade()

ttg_in

code block
# Calculate betweenness centrality and save values to new column
top_30_graph <- top_30_graph %>%
  mutate(betweenness = centrality_betweenness())

set.seed(1234)
ttg2 <- 
  ggraph(top_30_graph,
         layout = "kk"
  ) +
  geom_edge_link(
    aes(width = weight,
        color = year), 
        alpha = .6,
    arrow = arrow(length = unit(2, 'mm')),
    show.legend = FALSE
  ) +
  scale_edge_width(
    range = c(0.1, 4)
  ) +
  geom_node_point(
    color = "grey30",
    aes(size = betweenness)
  ) +
  geom_node_text(
  # Only show labels of top 75th percentile by betweenness centrality
    aes(label = ifelse(betweenness > quantile(betweenness, .75), uuid, "")),
    nudge_x = .7,
    size = 3
  ) +
  theme(
    plot.title = element_text(size = 20,
                              color = "#3A3B60"),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.title = element_text(),
    plot.background = element_rect(fill="#F8F3E6",colour="#F8F3E6")
  ) +
  transition_states(year,
                    transition_length = 3,
                    state_length = 2
  ) +
  labs(
    title = "Year: {closest_state}",
    subtitle = "Top Exporters by Betweenness Centrality >> \nLarger nodes represent higher out-degree Centrality, thicker links show higher frequency of trade"
  ) +
  enter_grow() +
  exit_fade()

ttg2

code block
# Calculate eigenvector centrality and save values to new column
ec <- eigen_centrality(top_30_graph)$vector
V(top_30_graph)$eigencentrality <- ec

set.seed(1234)
ttg3 <- 
  ggraph(top_30_graph,
         layout = "kk"
  ) +
  geom_edge_link(
    aes(width = weight,
        color = year), 
        alpha = .6,
    arrow = arrow(length = unit(2, 'mm')),
    show.legend = FALSE
  ) +
  scale_edge_width(
    range = c(0.1, 4)
  ) +
  geom_node_point(
    color = "grey30",
    aes(size = eigencentrality)
  ) +
  geom_node_text(
  # Only show labels of top 75th percentile by betweenness centrality
    aes(label = ifelse(eigencentrality > quantile(eigencentrality, .75), uuid, "")),
    nudge_x = .7,
    size = 3
  ) +
  theme(
    plot.title = element_text(size = 20,
                              color = "#3A3B60"),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.title = element_text(),
    plot.background = element_rect(fill="#F8F3E6",colour="#F8F3E6")
  ) +
  transition_states(year,
                    transition_length = 3,
                    state_length = 2
  ) +
  labs(
    title = "Year: {closest_state}",
    subtitle = "Company Network with Eigenvector Centrality >> \nLarger nodes represent higher Eigenvector Centrality, thicker links show higher frequency of trade"
  ) +
  enter_grow() +
  exit_fade()

ttg3

code block
# Calculate pageRank Score and save values to new column
pr <- page_rank(top_30_graph)$vector
V(top_30_graph)$pagerank <- pr

set.seed(1234)
ttg4 <- 
  ggraph(top_30_graph,
         layout = "kk"
  ) +
  geom_edge_link(
    aes(width = weight,
        color = year), 
        alpha = .6,
    arrow = arrow(length = unit(2, 'mm')),
    show.legend = FALSE
  ) +
  scale_edge_width(
    range = c(0.1, 4)
  ) +
  geom_node_point(
    color = "grey30",
    aes(size = pagerank)
  ) +
  geom_node_text(
  # Only show labels of top 75th percentile by betweenness centrality
    aes(label = ifelse(pagerank > quantile(pagerank, .75), uuid, "")),
    nudge_x = .7,
    size = 3
  ) +
  theme(
    plot.title = element_text(size = 20,
                              color = "#3A3B60"),
    legend.position = "bottom",
    legend.direction = "horizontal",
    legend.title = element_text(),
    plot.background = element_rect(fill="#F8F3E6",colour="#F8F3E6")
  ) +
  transition_states(year,
                    transition_length = 3,
                    state_length = 2
  ) +
  labs(
    title = "Year: {closest_state}",
    subtitle = "Company Network with PageRank Score >> \nLarger nodes represent higher Scores, thicker links show higher frequency of trade"
  ) +
  enter_grow() +
  exit_fade()

ttg4

2.3.1 : Visualising distribution of Centrality Measures

Data Preparation:

# Retrieve Centrality measures from nodes
top30_centrality <- data.frame(
  uuid = V(top_30_graph)$uuid,
  OutDegree = V(top_30_graph)$out_degree,
  InDegree = V(top_30_graph)$in_degree,
  Betweenness_centrality = V(top_30_graph)$betweenness,
  Eigenvector_centrality = V(top_30_graph)$eigencentrality,
  PageRank_score = V(top_30_graph)$pagerank
)
all(degree(top_30_graph) == top30_centrality$OutDegree + top30_centrality$InDegree)
[1] TRUE
code block
ggplot(top30_centrality,
       aes(x = InDegree,
           y = OutDegree,
           label = uuid)
  ) +  
  geom_abline(intercept = 0, 
              slope = 1,
              color = "salmon",
              alpha = .6
  ) +
  ylim(0, 25) +
  xlim(0, 25) +
  geom_point(
    aes(color = ifelse(OutDegree == 0 & InDegree == 0, "grey40",
                       ifelse(OutDegree > 0 & InDegree <= 0,"#1696d2", "#fdbf11")
                       )
        ),
    show.legend = FALSE
  ) +
  geom_label_repel(aes(label = uuid),
                  box.padding   = 0.35
  ) +
  annotate(
    geom = "text",
    x = 8,
    y = 20,
    label = "Net Exporters",
    size = 8,
    color = "grey70"
  ) +
  annotate(
    geom = "text",
    x = 20,
    y = 8,
    label = "Net Importers",
    size = 8,
    color = "grey70"
  ) +
  theme(plot.background = element_rect(fill="#F8F3E6",colour="#F8F3E6"))

code block
# Apply function to normalize values in dataframe
top30_centrality_scaled <- top30_centrality %>%
    select(-uuid)
    
top30_centrality_scaled <- transform_dataframe(top30_centrality_scaled)

# Pivot longer to get centrality measures as factors
top30_centrality_long <- gather(top30_centrality_scaled, key = "CentralityMeasure", value = "CentralityScore")
code block
# Density ridges to show distribution of data
ggplot(
    top30_centrality_long, 
    aes(x = CentralityScore, 
        y = CentralityMeasure, 
        fill = CentralityMeasure,
        color = CentralityMeasure)
  ) +
  geom_density_ridges(
    alpha = .6,
    scale = 2
  ) +
  geom_rug() +
  annotate(
    geom = "text",
    x = .7,
    y = 4.5,
    label = "Highly right-skewed distribution with smaller peaks"
  ) +
  labs(
    title = "Similar Distribution of Values Across Centrality Measures",
    x = "Centrality Score",
    y = "Centrality Measure"
  ) +
  theme(
    legend.position = "none",
    axis.title = element_blank(),
    panel.grid.major = element_blank(),
    plot.background = element_rect(fill="#F8F3E6",colour="#F8F3E6")
  )
Picking joint bandwidth of 0.0279

Insights from Network Graphs and Centrality Measure Distributions:

From the network graphs, we can see that key nodes are unchanged for the entire time period (2028-2034), but heavier traderoutes (identified by thicker links) seem to vary from key nodes through the years. This could be indicative of IUU activity, with fishy trading partners. The distribution graphs show intensely skewed distributions of centrality measures. This enables us filter out companies based on out-degree centrality > 10 and betweenness centrality > 10 to identify the top exporters by tradecount.

2.4: Who are the Top 30 exporters from Oceanus by export weight per Year?

Insights from Network Graphs and Centrality Measure Distributions:

Comparing the distribution of centrality measures, there are significantly more nodes with high degree but low betweenness centrality. This suggests that there are clusters of nodes in the network, and that these nodes are well-connected within their clusters but not to the rest of the nodes in other clusters. This could be indicative of various smaller IUU networks operating within the overall network.

2.5: What are the most heavily traded products?

Using hscode (product code), we can filter the most frequently traded products by number of transactions, total export weight, as well as monetary value.

code block
hscode_links <- mc2_links %>%
  filter(source != target) %>%
  group_by(hscode, year) %>%
  summarise(
    weight = n(),
    weightkg = sum(weightkg),
    value_usd = sum(valueofgoodsusd)) %>%
  ungroup()

hscode_weight <- top_n(hscode_links, 3, weight) 

kbl(hscode_weight,
    caption = "Top Traded Products by Frequency") %>%
 kable_styling(
   bootstrap_options = "striped", 
   full_width = F,
   position = "float_left")
Top Traded Products by Frequency
hscode year weight weightkg value_usd
306170 2032 26242 504304550 5469832600
306170 2033 27125 521920430 NA
306170 2034 27046 523579660 NA

Interestingly, the most frequently traded products for the data timerange are hscode 306170 in consecutive years 2032-2034.

code block
hscode_weightkg <- top_n(hscode_links, 3, weightkg) 

kbl(hscode_weightkg,
    caption = "Top Traded Products by Export Weight") %>%
 kable_styling(
   bootstrap_options = "striped", 
   full_width = F,
   position = "float_right")
Top Traded Products by Export Weight
hscode year weight weightkg value_usd
721049 2030 1933 2681588855 NA
721049 2032 1492 9454054095 NA
721049 2033 1897 5538412100 NA
code block
hscode_value <- top_n(hscode_links, 3, value_usd) 

kbl(hscode_value,
    caption = "Top Traded Products by Monetary Value") %>%
 kable_styling(
   bootstrap_options = "striped", 
   full_width = F,
   position = "float_left")
Top Traded Products by Monetary Value
hscode year weight weightkg value_usd
293339 2034 51 16699830 224473283155
293399 2032 53 14513980 226041275060
293410 2030 133 2526805 113052776765

::: {.panel-tabset}

3: Categorizing fishy companies

The network analysis metrics in section 2 are used to derive attributes of each fishy company:

sus_tradecount sus_tradeweight sus_score
Type Bool (0/1) Bool (0/1) int
Description T if degree AND
betweenness centrality
values in top quartile
T if degree OR
betweenness centrality
value in top quartile
Sum of both
‘sus_’ flags
code block
source_top <- source_count %>%
  filter(sourcecount > 100)

target_top <- target_count %>%
  filter(targetcount > 100)

mc2_links_top <- mc2_links_agg %>%
  semi_join(target_top, by = join_by(target)) %>%
  semi_join(source_top, by = join_by(source))

2.3: Yearly transactions for top traderoutes by frequency of trades

code block
links_top_year <- mc2_links_top %>%
  group_by(
    source, target, year) %>%
  summarise(
  # Number of active months in the year
    months_active = n(),
  # Total number of transactions per year
    weight = sum(weight),
  # Avg weight of trade per month
    avg_weightkg = round((sum(weightkg)/months_active),0)) %>%
  ungroup()
`summarise()` has grouped output by 'source', 'target'. You can override using
the `.groups` argument.
code block
datatable(links_top_year)
Warning in instance$preRenderHook(instance): It seems your data is too big for
client-side DataTables. You may consider server-side processing:
https://rstudio.github.io/DT/server.html
code block
# filter distinct source and target from mc2_edges
distinct_source <- links_top_year %>%
    distinct(source)

distinct_target <- links_top_year %>%
    distinct(target)

# Select only overlapping nodes and distinct companies from links
nodes_source <- inner_join(
    distinct_source, mc2_nodes,
    by = c("source" = "id")) %>%
  mutate(id = source)

nodes_target <- inner_join(
    distinct_target, mc2_nodes,
    by = c("target" = "id")) %>%
  mutate(id = target)

# Create new nodes data by combining filtered dataframes
nodes_new <- bind_rows(nodes_source, nodes_target) %>%
  select(id, shpcountry, rcvcountry)
code block
# Combine edges and nodes to a graph dataframe
links_top_vis <- links_top_year %>% 
  rename(
    "from" = "source", 
    "to" = "target") %>%
  select(from, to, year, months_active, weight, avg_weightkg)
code block
nodes_vis <- nodes_new %>%
  rename("group" = "rcvcountry")
code block
visNetwork(nodes_vis, links_top_vis,
          height = "500px", width = "100%") %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visGroups(groupname = "Oceanus", shape = "icon", 
            icon = list(code = "f043", color = "deepskyblue4")) %>%
  visGroups(groupname = "Unknown", shape = "icon", 
            icon = list(code = "f059", color = "darkslategray4")) %>%
  addFontAwesome() %>%
  visLegend() %>%
  visEdges(arrows = "to") %>%
  visOptions(highlightNearest = list(enabled = T, degree = 2, hover = T),
             nodesIdSelection = TRUE,
             selectedBy = "shpcountry",
             collapse = TRUE)